Question and Initial Research
How can we most accurately predict if someone will have a stroke and what are the most important predictive factors?
Being able to answer this question would provide many benefits. According to the CDC, strokes are one of the leading causes of death in the U.S. The World Health Organization also reports stroke as one of the leading causes of death worldwide. They are hard to deal with, and being able to predict and therefore prevent them would be useful. If high-risk individuals could be identified, they could also be informed of warning signs of stroke to look out for, as stroke survival rates are much higher when emergency treatment begins quickly.
Being able to accurately predict strokes and therefore prevent them would also have economic benefits. According to the CDC, stroke-related costs in the United States were nearly $46 billion between 2014 and 2015, including the cost of health care services, medicines to treat stroke, and missed days of work. Strokes are also one of the leading causes of serious long-term disability. Strokes reduce mobility in more than 50% of stroke survivors over the age of 65.
We found a dataset on Kaggle to use that includes records from both ischemic and hemorrhagic strokes. The criteria used to define a stroke was that a stroke occurs when a blood vessel that carries oxygen and nutrients to the brain is either blocked (ischemic) by a clot or bursts (hemorrhagic). This dataset includes 11 unique features of each patient, in addition to whether or not they had a stroke. The recorded features are:
- patient identification number
- gender
- age
- presence of hypertension
- presence of heart disease
- marital status
- job type
- residence type
- average glucose level
- BMI
- smoking status
The CDC indicates that high blood pressure, high cholesterol, smoking, obesity, and diabetes are leading causes of stroke. From this data analysis, we would therefore expect to see that smoking status, BMI, hypertension, and heart disease would be the most important factors in predicting stroke.
Exploratory Data Analysis
Exploratory data analysis is important for us to better understand this dataset before we go into deeper analysis. We want to identify the most important variables and variables that aren’t as useful. We also want to identify possible outliers, missing values, and understand the relationship between variables. Ultimately, the goal is to maximize our insight of the dataset to minimize potential error that could occur later in the process.
First Glance
A first look at the data:
## spec_tbl_df[,12] [5,110 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:5110] 9046 51676 31112 60182 1665 ...
## $ gender : chr [1:5110] "Male" "Female" "Male" "Female" ...
## $ age : num [1:5110] 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : num [1:5110] 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : num [1:5110] 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr [1:5110] "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr [1:5110] "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr [1:5110] "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num [1:5110] 229 202 106 171 174 ...
## $ bmi : chr [1:5110] "36.6" "N/A" "32.5" "34.4" ...
## $ smoking_status : chr [1:5110] "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : num [1:5110] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. gender = col_character(),
## .. age = col_double(),
## .. hypertension = col_double(),
## .. heart_disease = col_double(),
## .. ever_married = col_character(),
## .. work_type = col_character(),
## .. Residence_type = col_character(),
## .. avg_glucose_level = col_double(),
## .. bmi = col_character(),
## .. smoking_status = col_character(),
## .. stroke = col_double()
## .. )
Summary Statistics
Initial summary statistics, after re-classing some of the variables:
## id gender age hypertension heart_disease
## Min. : 67 Female:2994 Min. : 0.08 0:4612 0:4834
## 1st Qu.:17741 Male :2115 1st Qu.:25.00 1: 498 1: 276
## Median :36932 Other : 1 Median :45.00
## Mean :36518 Mean :43.23
## 3rd Qu.:54682 3rd Qu.:61.00
## Max. :72940 Max. :82.00
##
## ever_married work_type Residence_type avg_glucose_level
## No :1757 children : 687 Rural:2514 Min. : 55.12
## Yes:3353 Govt_job : 657 Urban:2596 1st Qu.: 77.25
## Never_worked : 22 Median : 91.89
## Private :2925 Mean :106.15
## Self-employed: 819 3rd Qu.:114.09
## Max. :271.74
##
## bmi smoking_status stroke
## Min. :10.30 formerly smoked: 885 0:4861
## 1st Qu.:23.50 never smoked :1892 1: 249
## Median :28.10 smokes : 789
## Mean :28.89 NA's :1544
## 3rd Qu.:33.10
## Max. :97.60
## NA's :201
The dataset contains data from 5110 patients. It is pretty evenly split with gender, 59% female and 41% male. The binary variables hypertension and heart disease are much less balanced, with only 9.75% of the patients having hypertension and 5.40% of them having heart disease. According to the American Heart Association, almost half of adults in the U.S. have hypertension, or chronically high blood pressure, which is directly related to heart disease. So, the dataset may not accurately represent the population, but it could still be used to analyze predictive factors for strokes. The prevalence of stroke in this dataset is 4.87%.
The dataset is pretty evenly split among gender.
The two variables we are most interested in are not very balanced. On the left is a histogram for hypertension, and on the right is one for heart disease. As mentioned before, only 9.75% of the patients have hypertension and 5.40% of them have heart disease. This could lead to issues in creating predictive models later.
Smoking and BMI could also be of interest.
The average BMI is 28.89, which is classified as overweight. Being overweight or obese increases your risk for heart disease and stroke (but then again BMI isn’t a completely accurate measurement). Based on the information from the CDC, we would expect patients who smoke or even formerly smoked or with higher BMIs would have a higher incidence of stroke.
Visualizations
Next we wanted to look at some visualizations of numeric variables we think will be important in stroke predictions, other than the important binary variables like hypertension and heart disease.
The bmi vs. stroke incidence chart shows that the average bmi for patients who had a stroke was higher than those who didn’t.
Diabetes is also another risk factor for stroke. Although this dataset does not have presence of diabetes as a variable, it does have average glucose level, which can be used to classify if patients are diabetic. The American Diabetes Association classifies normal blood glucose levels as less than 100 mg/dl, prediabetes is 100 mg/dl to 125 mg/dl, and diabetes is 126 mg/dl or higher. Looking at a plot of average blood glucose levels
This graph doesn’t give quite as clear of a result as the previous one, mainly because the distribution of the data is a little strange. There’s a lot of people with somewhat normal levels and then a lot of people with high levels, but not as many in between. The average glucose level was still higher for patients who had a stroke.
Another numeric variable that would be interestng to look at would be age, as the risk for stroke increases with age.
The scatterplot cleary indicates that those patients who had strokes were older, 40+ years on average.
As confirmed by some of the research we did, the most at-risk patient would be someone with hypertension and heart-disease, with a high bmi and average glucose level, who smokes, and who is older.
Variable Correlation
Something that would be interesting to look at would be the correlation between variables, to see if any of the variables vary in similar ways.
This is a visualization of the correlation plot, which confirms that no variables appear to be highly correlated with one another. There is small correlations between age and hypertension, heart disease, high glucose levels, and stroke incidence. Some of the common comorbid conditions also have smaller correlation levels, like heart disease, stroke, hypertension, and average glucose level.
Methods and Evaluation
Do evaluation with each method, not all after
• Methods – Techniques you are using to address your question and the results of those methods. • Evaluation of your model – Select appropriate metrics and explain the output as it relates to your question.
KNN
Data Cleaning
Values with the value “N/A” and "Unknown in the columns bmi and smoking_status were replaced with NA. All NA values were then removed from the dataset. The first column, id, was also removed. Since the data is heavily unbalanced with only roughly 5% of the patients being positive for stroke, the data was sampled to make it more balanced. All 180 patients who had a stroke were joined with a randomly sampled 180 patients from the original dataset that didn’t have a stroke.
stroke <- read.csv("stroke.csv")
stroke$bmi <- gsub("N/A", NA, stroke$bmi)
stroke$smoking_status <- gsub("Unknown", NA, stroke$smoking_status)
stroke <- na.omit(stroke)
stroke <- stroke[ -c(1)]
stroke$gender <- as.factor(stroke$gender)
stroke$ever_married <- as.factor(stroke$ever_married)
stroke$work_type <- as.factor(stroke$work_type)
stroke$Residence_type <- as.factor(stroke$Residence_type)
stroke$smoking_status <- as.factor(stroke$smoking_status)
stroke_positive <- subset(stroke, stroke == 1)
stroke_negative <- subset(stroke, stroke == 0)
set.seed(1980)
stroke_sample <- stroke_negative[sample(1:nrow(stroke_negative), size = 180),]
stroke_final <- rbind(stroke_sample, stroke_positive)
stroke <- lapply(stroke_final, function(x) as.numeric(x))
stroke <- as_tibble(stroke)Baserate for Dataset
The base-rate for the data set is 50%.
Creating Test and Training Set
The test and training set was created with an 80/20 partition where 80% of the data set was used for training and 20% was used for testing.
set.seed(1980)
stroke_break <- createDataPartition(stroke$stroke, times = 1, p = 0.8, list = FALSE)
training_stroke <- stroke[stroke_break,]
test_stroke <- stroke[-stroke_break,]Choosing the best K value for kNN
Using the elbow method, we determined that a K value of 5 was best for our analysis.
KNN Model
The confusion matrix below illustrates some of the the evaluation metrics for the model. The accuracy of the model overall is 77.78% which is good considering that the base rate for the data set was 50%. The Balanced Accuracy for the model was around 77.78% which is also pretty good compared to the base rate. The false negative rate was 16.67% which is okay but that means 16.67% of people who had a stroke were classified as not being at risk of a stroke from our model. The false positive rate was 27.78% which is not great and higher than the false negative but since our model is catered towards trying to catch people who are likely to have a stroke before it happens, it is better to have a higher false positive rate than a high false negative rate. The Kappa value for the model was 0.5556 which indicates moderate agreement. For a model like this, it is good that the false negative rate is lower than the false positive rate since it would be very bad if a patient who was at risk was told that they didn’t have to worry about anything. The Log-Loss was calculated to be 2.790974 which is bad since that means the model is not totally confident for a lot of the predictions. The F1-score was 0.7647059 which is pretty good meaning that the there aren’t too many false negatives and false positives.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 26 6
## 1 10 30
##
## Accuracy : 0.7778
## 95% CI : (0.6644, 0.8673)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 1.199e-06
##
## Kappa : 0.5556
##
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.8333
## Specificity : 0.7222
## Pos Pred Value : 0.7500
## Neg Pred Value : 0.8125
## Prevalence : 0.5000
## Detection Rate : 0.4167
## Detection Prevalence : 0.5556
## Balanced Accuracy : 0.7778
##
## 'Positive' Class : 1
##
ROC/AUC Output
The AUC had a value of 0.77276273 which is pretty good meaning that our model is doing a good job of distinguishing between patients with and without disease.
## integer(0)
## [[1]]
## [1] 0.7727623
Decision Tree
Data Cleaning
The data was cleaned similar to how it was cleaned in the kNN model with the exception of the added binary columns: bmi_state, glucose_state, and age_state. BMI_state was created based off of the bmi of each patient and if they had a bmi over 25 then they were classified as overweight and those with a bmi less than 25 were classified as having a normal weight. Glucose_state was done in a similar manner where people with a glucose level less than 125 was classified as normal and those over 125 were classified as diabetic. Lastly, age_state was classified based off of the average age in the United States which is 38.4 so people who were younger than 38.4 were classified as young and those older than 38.4 were classified as old. The original columns, bmi, age, and glucose were then removed as we were unable to use them in our tree since they are not binary.
stroke <- read.csv("stroke.csv")
stroke$bmi <- gsub("N/A", NA, stroke$bmi)
stroke$smoking_status <- gsub("Unknown", NA, stroke$smoking_status)
stroke <- na.omit(stroke)
# Data Cleaning Specifically for Decision Tree - Changing things to Binary
stroke$bmi <- as.numeric(stroke$bmi)
stroke <- stroke %>%
mutate(bmi_state = ifelse(bmi <= 25, "normal", "overweight"))
stroke <- stroke %>%
mutate(glucose_state = ifelse(avg_glucose_level <= 125, "normal", "diabetic"))
stroke <- stroke %>%
mutate(age_state = ifelse(age <= 38.4, "young", "old"))
# https://www.statista.com/statistics/241494/median-age-of-the-us-population/#:~:text=In%202018%2C%20the%20median%20age,United%20States%20was%2038.4%20years.
stroke_positive <- subset(stroke, stroke == 1)
stroke_negative <- subset(stroke, stroke == 0)
set.seed(1980)
stroke_sample <- stroke_negative[sample(1:nrow(stroke_negative), size = 180),]
stroke_final <- rbind(stroke_sample, stroke_positive)
stroke_final <- stroke_final[-c(1, 2, 3, 7, 9, 10, 11)]
stroke_final <- lapply(stroke_final, function(x) as.factor(x))
stroke_final <- as_tibble(stroke_final)Creation of Test Set
The test and training set was created with an 80/20 partition where 80% of the data set was used for training and 20% was used for testing.
set.seed(1980)
stroke_break <- createDataPartition(stroke_final$stroke, times = 1, p = 0.8, list = FALSE)
training_stroke <- stroke_final[stroke_break,]
test_stroke <- stroke_final[-stroke_break,]Baserate Calculation for Stroke
The baserate calculation is 50% which means that if guessing randomly you have a 50% chance of guessing correctly.
Building the Model
The 5 variables used by the model were age_state, glucose_state, heart_disease, residence_type, ever_married, and hypertension. The most important variable in this model was age_state, meaning that age is a really good indicator of whether or not you will have a stroke. Next, glucose_state was also a good estimate for stroke likelihood. Surprisingly, hypertension wasn’t a very good indicator. Based off of the relative error, 6 is the ideal number since it has the lowest relative error.
set.seed(1950)
tree_stroke = rpart(stroke~.,
method = "class",
parms = list(split = "gini"),
data = training_stroke,
control = rpart.control(cp=0.01))## age_state glucose_state heart_disease Residence_type ever_married
## 26.3172003 7.7157118 2.3798148 1.6115971 1.2789590
## hypertension
## 0.2859285
Confusion Matrix and Prediction Model
The model had an overall accuracy of 61.11% and a balanced accuracy of 61.11%.This is not too much better than the base rate of 0.5 indicating that the model is not much better than random chance. The sensitivity of the model is 0.75 indicating that the false negative rate is 25%. The specifcity of the model is 0.4722 indicating that the false positive rate is 52.78% which is not very good. But like the kNN model, it is better to have a high false positive rate than it is to have a high false negative rate in the context of our problem since diagnosing low chance stroke people to be at risk of a stroke is better than diagnosing people who have a high chance of a stroke as having a lower chance of a stroke. The Log-Loss value was 0.54108 which is okay since that means the confident is fairly confident about its predictions. The F1-score was 0.5483831 which is not great. This is likely due to the high false positive rate.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 17 9
## 1 19 27
##
## Accuracy : 0.6111
## 95% CI : (0.4889, 0.7238)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.03818
##
## Kappa : 0.2222
##
## Mcnemar's Test P-Value : 0.08897
##
## Sensitivity : 0.7500
## Specificity : 0.4722
## Pos Pred Value : 0.5870
## Neg Pred Value : 0.6538
## Prevalence : 0.5000
## Detection Rate : 0.3750
## Detection Prevalence : 0.6389
## Balanced Accuracy : 0.6111
##
## 'Positive' Class : 1
##
AUC/ROC Curve
The AUC/ROC curve shows an AUC value of 0.6111 which is pretty poor meaning that the model is not doing a good job of distinguishing between patients who get a stroke and those that are not at risk of a stroke.
## Area under the curve: 0.6111
Random Forest
Prep the data
stroke = stroke %>%
select(-id)
# make all parameters factors
stroke_factor = as.data.frame(
apply(
stroke,
2,
function(x) as.factor(x))
)
# the proportion of data used for training
training_split = 0.9
training_rows = sample(
1:nrow(stroke_factor),
dim(stroke_factor)[1]*training_split,
replace=FALSE
)
# split into training and test
stroke_factor_training_data = stroke_factor[training_rows,]
stroke_factor_testing_data = stroke_factor[-training_rows, ]Build the first random forest
Confusion matrix
## 0 1 class.error
## 0 2925 0 0
## 1 158 0 1
The random forest model accuracy was 94%
Guessing would give us an accuracy of 90%
Parameter Importance
Mean Decrease Accuracy plot shows how much accuracy the model losses by excluding each variable (top of plot is most important)
The top five for MDA are: Age, BMI, Heart Disease, Marriage, and Hyper Tension with all five being 3 or higher
Gini plot shows how each paramter contributes to the homogeneity of the nodes. Higher the better.
The top five for MDG are: Age, Glucose Level, BMI, Heart Disease and Hyper Tension with the first three being much higher (4-6 times)
Visualize random forest results
Optimize Model with Fewer Trees
Build the smaller random forest
Confusion matrix
## 0 1 class.error
## 0 2925 0 0
## 1 158 0 1
The random forest model with only four trees accuracy was 94%
And our larger tree model gave an accuracy of 94%
Guessing would give us an accuracy of 90%
Parameter Importance
The top five for MDA are: Age, BMI, Gender, Heart Disease, and Hyper Tension.
What is interesting is that we are now seeing a negative or zero impact from some of the parameters, we will remove these and compare below. The largets parameters to consider later are Age, Gender and Heart Disease.
The top three for MDG are: Glucose Level, Age, and BMI. All other parameters have a score of 0.5 or lower and will not be considered.
Visualize random forest results
Build a random forest with only the most valuable parameters “Lean”
# create leaner testing and training data sets
stroke_factor_training_data_lean = stroke_factor_training_data %>%
select(age, gender, heart_disease, bmi, avg_glucose_level, stroke)
stroke_factor_testing_data_lean = stroke_factor_testing_data %>%
select(age, gender, heart_disease, bmi, avg_glucose_level, stroke)
# create the random forest
stroke_rf_lean = randomForest(
as.factor(stroke)~.,
stroke_factor_training_data_lean,
#y = NULL, #<- A response vector. This is unnecessary because we're specifying a response formula.
#subset = NULL, #<- This is unnecessary because we're using all the rows in the training data set.
#xtest = NULL, #<- This is already defined in the formula by the ".".
#ytest = NULL, #<- This is already defined in the formula by "parent".
ntree = 4, #<- Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets classified at least a few times.
mtry = mytry_tune(stroke_factor_training_data_lean), #<- Number of variables randomly sampled as candidates at each split. Default number for classification is sqrt(# of variables). Default number for regression is (# of variables / 3).
replace = TRUE, #<- Should sampled data points be replaced.
#classwt = NULL, #<- Priors of the classes. Use this if you want to specify what proportion of the data SHOULD be in each class. This is relevant if your sample data is not completely representative of the actual population
#strata = NULL, #<- Not necessary for our purpose here.
sampsize = 100, #<- Size of sample to draw each time.
nodesize = 10, #<- Minimum numbers of data points in terminal nodes.
#maxnodes = NULL, #<- Limits the number of maximum splits.
importance = TRUE, #<- Should importance of predictors be assessed?
#localImp = FALSE, #<- Should casewise importance measure be computed? (Setting this to TRUE will override importance.)
proximity = FALSE, #<- Should a proximity measure between rows be calculated?
norm.votes = TRUE, #<- If TRUE (default), the final result of votes are expressed as fractions. If FALSE, raw vote counts are returned (useful for combining results from different runs).
do.trace = TRUE, #<- If set to TRUE, give a more verbose output as randomForest is run.
keep.forest = TRUE, #<- If set to FALSE, the forest will not be retained in the output object. If xtest is given, defaults to FALSE.
keep.inbag = TRUE #<- Should an n by ntree matrix be returned that keeps track of which samples are in-bag in which trees?
)## ntree OOB 1 2
## 1: 13.57% 10.72% 68.71%
## 2: 8.15% 3.69% 90.51%
## 3: 7.20% 2.84% 87.97%
## 4: 6.00% 1.26% 93.67%
Confusion matrix
## 0 1 class.error
## 0 2888 37 0.01264957
## 1 148 10 0.93670886
This random forest model with a smaller number of paramters gave an accuracy of 93%
The random forest model with only four trees accuracy was 94%
And our larger tree model gave an accuracy of 94%
Guessing would give us an accuracy of 90%
Parameter Importance
How do the models do on the testing data
Confusion Matrix and Model Statistics
Large Predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 321 22
## 1 0 0
##
## Accuracy : 0.9359
## 95% CI : (0.9045, 0.9594)
## No Information Rate : 0.9359
## P-Value [Acc > NIR] : 0.5564
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 7.562e-06
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.93586
## Precision : NA
## Recall : 0.00000
## F1 : NA
## Prevalence : 0.06414
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : 1
##
Extremely low kappa
Small Predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 321 22
## 1 0 0
##
## Accuracy : 0.9359
## 95% CI : (0.9045, 0.9594)
## No Information Rate : 0.9359
## P-Value [Acc > NIR] : 0.5564
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 7.562e-06
##
## Sensitivity : 0.00000
## Specificity : 1.00000
## Pos Pred Value : NaN
## Neg Pred Value : 0.93586
## Precision : NA
## Recall : 0.00000
## F1 : NA
## Prevalence : 0.06414
## Detection Rate : 0.00000
## Detection Prevalence : 0.00000
## Balanced Accuracy : 0.50000
##
## 'Positive' Class : 1
##
Lean Predictions
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 320 20
## 1 1 2
##
## Accuracy : 0.9388
## 95% CI : (0.9079, 0.9617)
## No Information Rate : 0.9359
## P-Value [Acc > NIR] : 0.4688
##
## Kappa : 0.1469
##
## Mcnemar's Test P-Value : 8.568e-05
##
## Sensitivity : 0.090909
## Specificity : 0.996885
## Pos Pred Value : 0.666667
## Neg Pred Value : 0.941176
## Precision : 0.666667
## Recall : 0.090909
## F1 : 0.160000
## Prevalence : 0.064140
## Detection Rate : 0.005831
## Detection Prevalence : 0.008746
## Balanced Accuracy : 0.543897
##
## 'Positive' Class : 1
##
Y axis is ground truth, x axis is predictions
We see that we are never predicting that a person has a stroke, and have the same predictions for each model.
With this final analysis, we can see that while the model is quite accurate (when looking at strickly the numbers), it is unable to give reference for when a patient may have a stroke, which is the entire purpose of the technology.
ROC Curve
Fairness Assessment
• Fairness assessment – if necessary, should you happen to have any protected classes.
Conclusions
• Conclusions – What can you say about the results of the methods section as it relates to your question given the limitations to your model.
Future Work
• Future work – What additional analysis is needed or what limited your analysis on this project.
-the dataset was kind of unbalanced
References
https://www.cdc.gov/stroke/facts.htm
https://www.kaggle.com/fedesoriano/stroke-prediction-dataset
https://www.cdc.gov/bloodpressure/facts.htm#:~:text=Nearly%20half%20of%20adults%20in,are%20taking%20medication%20for%20hypertension.&text=Only%20about%201%20in%204,have%20their%20condition%20under%20control.
https://www.sciencedaily.com/releases/2019/01/190131084238.htm#:~:text=01%2F190131084238.htm-,At%20least%2048%20percent%20of%20all%20adults%20in%20the%20United,according%20to%20the%20latest%20statistics.
https://towardsdatascience.com/exploratory-data-analysis-in-r-for-beginners-fe031add7072